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An efficient MCMC algorithm is presented to cluster the nodes of a network 

such that nodes with similar role in the network are clustered together. This 

is known as block-modelling or block- clustering. The model is the stochas- 

d tic blockmodel (SBM) with block parameters integrated out. The resulting 

CZ2 marginal distribution defines a posterior over the number of clusters and 

cluster memberships. Sampling from this posterior is simpler than from the 

"^ original SBM as transdimensional MCMC can be avoided. The algorithm 

f-r^ is based on the allocation sampler. It requires a prior to be placed on the 

OO number of clusters, thereby allowing the number of clusters to be directly 

2^ estimated by the algorithm, rather than being given as an input parameter. 

_^ Synthetic and real data are used to test the speed and accuracy of the model 

O and algorithm, including the ability to estimate the number of clusters. The 

^ algorithm can scale to networks with up to ten thousand nodes and tens of 

• • millions of edges. 

^> 

k> Keywords: Clustering, Social networks, Blockmodelling, Computational 

^ Statistics, MCMC. 

1. Introduction 

This paper is concerned with block-modelling - an approach to clustering 
the nodes in a network, based on the pattern of inter-connections between 
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them. The starting point for the method presented here is the stochastic 



block model (SBM) Nowicki and Snijders (2001). The goal is to improve the 



speed and scalabihty, without compromising on accuracy. We use conjugate 
priors and integration in order to focus on the marginal distribution of inter- 
est, this marginalization is also referred to as the 'collapsing' of the nuisance 



parameters (Liu, 1994 Wyse and Friel 2012). This allows us to implement 



an efficient algorithm based on the allocation sampler of Nobile and Fearn 



side (2007). We incorporate existing extensions, such as the weighted-edge 



model of Mariadassou et al. (2010), and show how this extension can be in- 



corporated within our collapsing and within our algorithm. As required by 
the allocation sampler, we place a prior on the number of clusters, allowing 
the number of clusters to be directly estimated. Together, these techniques 
allow us to avoid the more complex forms of transdimensional MCMC and 
they also allow us to avoid the need for post-hoc model selection via criteria 
such as the ICL. 

We show that our method can accurately and efficiently estimate the num- 
ber of clusters - an improvement over many existing methods. Our algorithm, 
and the data we have used in section 6A and our survey data used in section [7} 
are available at http://sites.google.com/site/aaronmcdaid/sbni. 

The concept of clustering is broad and originated outside of network anal- 
ysis, where the input data is in the form of real- valued vectors describing the 
location of the data points in a Euclidean space. Network clustering takes a 
set of connected nodes as input and finds a partition of the nodes based on 
the network structure. This finds application in many different contexts. For 
instance, in bio-informatics, networks of protein-protein interactions are anal- 
ysed and clustering is applied to find functional groups of proteins. Interest 
in social network analysis has grown greatly in recent years, with the avail- 
ability of many networks, such as Facebook datasets, of human interactions. 
Clustering of such social networks has been applied in order to find social 
communities. In the following, we will distinguish the community-finding 
problem from the more general setting of block-modelling. 

In network analysis, the input data may be described mathematically 
as a graph, which is a set of nodes (where each node represents an entity, 
say, a person) and a set of edges linking pairs of nodes together. An edge 
might represent a friendship on Facebook or a phone call on a mobile phone 
network. In section [7| we apply our method to the network of interactions 
between participants at a summer school. 

Given a network, the goal in block-modelling is to cluster the nodes such 



that pairs of nodes are clustered together if their connectivity pattern to the 
clusters in the rest of the network is similar. A cluster might, for exam- 
ple, consist of a set of nodes which do not tend to have connections among 
themselves at all. Given two nodes in this cluster (node i and node j), the 
neighbours of i tend to be in the same clusters as are the neighbours of 
j. Community-finding has focussed on finding clusters of high internal edge 
density, where an edge between two nodes will tend to pull the two nodes 
into the same cluster, and a non-edge will tend to push them into separate 
clusters. This contrasts with block-modelling, which allows clusters to have 
low internal edge density. Block-modelling is able to find such community 
structure, but it is a more general method that is also able to find other types 
of structure. 

A variety of other, non-probabilistic, approaches have been used to tackle 
the broad problem of block-modelling (Everett, 1996; Chan et al. , 2011). 



Outside of block-modelling, there are other solutions for community-finding 



man 



in networks (Newman and Girvan 2004; Girvan and Newman 2002 New- 



2004). Many probabilistic clustering models have also been applied 



(Handcock et al., 2007 Hoff et al. 2002 Airoldi et al. 2008). 



There is a huge variety of methods, and we will not attempt to summarize 
them further; for the rest of this paper, we will focus on the SBM and on 
algorithms for the SBM. For more details, in particular about community 



finding, see the excellent review article of Fortunato (2010). 



The remainder of the paper is structured as follows. In section |2} we 
define the SBM and define the notation used in the paper. We then define, 
in section [3| the conjugate priors and integration that we use in order to 
access the relevant marginal distribution. Section |4] discusses other closely- 
related models and algorithms and in particular gives consideration to the 
issue of how to select the number of clusters (model selection), comparing the 
approach we have used to other approaches and noting connections among 
the methods. 

Section [5] describes the algorithm we use; without collapsing, it would 



have been necessary to use full Reversible Jump MCMC (Green (1995)) to 
search a sample space of varying dimension and this could be much slower. 
In section [6} we evaluate our method on synthetic networks, showing how 
the number of clusters can be estimated accurately and the nodes assigned 
to their correct cluster with high probability. We also test the scalability 
and efficiency of the algorithm by considering synthetic datasets with ten 
thousand nodes and ten million edges. 



In section [7| we evaluate our method on a dataset of interactions, gath- 
ered by a survey, of participants at a doctoral summer school attended by 
one of the authors of this paper. The method is able to detect interest- 
ing structures, demonstrating the differences between block-modelling and 
community-finding. Section IS] draws some conclusions. 



2. Stochastic Block Model(SBM) 



As formulated in Nowicki and Snijders (2001), a network describes a 



relational structure on a set of nodes. Each edge in the network describes a 
relationship between the two nodes it links. A general case of a finite alphabet 
of states relating a pair of nodes is considered but in the simplest case. 



discussed by the same authors in Snijders and Nowicki (1997), relationships 
are binary - an edge joining a pair of nodes either exists or not. The network 
can be undirected, corresponding to symmetric relationships between the 
nodes, or may be directed, where a relationship from node i to node j does 
not necessarily imply the same relationship exists from node j to node i. 
Finally, a self-loop - a relationship from node to itself - may or may not be 
allowed. 

Throughout the paper, we use P(-) to refer to probability mass (i.e. of 
discrete quantities) and p(-) to refer to probability density (i.e. of continuous 
quantities). N is the number of nodes in the network and K is the number 



of clusters. In the algorithm proposed in Nowicki and Snijders (2001), these 



are given input values, although in our approach, we treat K as a random 
variable with a given prior distribution. Given A^ and K, the SBM describes 
a random process for assigning the nodes to clusters and then generating a 
network. Specifically, the cluster memberships are represented by a random 
vector z of length A^ such that Zi G {1, . . . , K} records the cluster containing 
node i. Zi follows a multinomial distribution. 



lid 



Multinomial(l; 9i, ... ,9k) 



such that 9i is the probability of a node being assigned to cluster z (1 = 
Ylk=i^k)- The vector 9 is itself a random variable drawn from a Dirich- 



let prior with dimension K. The parameter to the Dirichlet is a vector 
(ai, . . . ,aK) of length K. We follow Nowicki and Snijders (2001) by fixing 
the components of this vector to a single value a, and by default a = 1, 



9 ~ Dirichlet («! = a, ^2 



a, 



.aK 



a] 



This describes fully how the A^ nodes are assigned to the K clusters. Next 
we describe how, given this clustering z, the edges are added between the 
nodes. 

A network can be represented as an A^ x A^ adjacency matrix, x, such 
that Xij represents the relation between node i and node j (taking values 
1 or in the binary case). Denote by X(^ki) the submatrix corresponding to 
the block of connections between nodes in cluster k and nodes in cluster /. 
If the network is undirected, there are \K[K + 1) blocks, corresponding to 
each pair of clusters; and if the network is directed, there are K^ clusters, 
corresponding to each ordered pair of clusters. 

It is generally simpler to discuss the directed model; unless otherwise 
stated, the formulae presented here apply only to the directed case. The def- 
initions and derivations can easily be applied to the undirected case, provided 
that care is taken only to consider each pair of nodes exactly once. 

If self-loops are not allowed, then the diagonal entries of x, Xa, are ex- 
cluded from the model. It is assumed that, given K and z, connections are 
formed independently within a block so that 

P{x\z, K, vr) = JJ P(x(H) k, K, TTki) , 

k.l 

where 

{i\zi=k} {j\zj=l} 

and the matrix vr = {ttm} describes the cluster-cluster interactions, vr is a 
K X K matrix, but for undirected networks only the diagonal and upper 
triangle are relevant. Specifically, for binary networks, Tr^/ represents the 
edge density within the block, and edges follow the Bernoulli distribution, 

Xij\z, K,T[ r^ Bernouni(7r2^2j . 

Each of the Tiki is drawn from the conjugate Beta(/3i,/32) prior, 

TTki ~ Beta(/3i,/32). 



Again we follow Nowicki and Snijders (2001) and choose /3i = /32 = 1, giving 
a Uniform prior. 

This completes the description of the Bayesian presentation of the SBM. 



A different approach is taken in other work, such as that of Daudin et al. 



(2008), where, using essentially the same model, the goal is to take a 
point estimate of the parameters, [71,6), for a given number of clusters K. 
Specifically, the aim is to find the MLE; the value of [if, 6) which maximizes 
P(x|7r, 6', K). This is described as the frequentist approach, in contrast to the 
fully Bayesian approach where a distribution of parameter values is allowed 
instead of a point estimate. We will return to this issue in a little more detail 
in section |4] in order to discuss the practical differences from an algorithmic 
point of view. 

2.1. Data model variations 



The model is naturally extended in Nowicki and Snijders (2001) to allow 



for a finite alphabet of two or more relational states, where instead of using 
a Bernoulli with a Beta prior for x and vr, we can use a Multinomial and a 
Dirichlet to model this alphabet. The Bernoulli- and-Beta-prior model is just 
a special case of the Multinomial-and-Dirichlet-prior model. Alternatively, 
we can allow an infinite support and extend the model to allow for non- 
negative integer weights on the edges, by placing a Poisson distribution on 
as seen in Mariadassou et al. (2010). Now iiki represents the edge 



P{X\'K,Z), 



rate and is drawn from a Gamma prior, 

Xij \z, K,TT r^ Poisson(7r2 



TTfci 



Gamma(s, 



We do not suggest any default for the hyperparameters s and 0. A further 
extension to real-valued weights is also possible, by using a Gaussian for 
p{x\tt,z) and suitable prior on vr, following Wyse and Friel (2012). These 



variations, and others, are described in Mariadassou et al. (2010), but they 
do not discuss conjugate priors. 

The integration approach and algorithm described later in this paper can 
be applied to many variants of edge model, however we focus in the remainder 
of the paper on the Bernoulli and Poisson models that are supported in our 
software. 

In summary, given A^ and K the random process generates 6, z, it and ul- 
timately the network x. The two main variables of interest are the clustering 
z and the network x. In a typical application, we have observed a network x 
and perhaps we have an estimate of K, and our goal is to estimate z. 



3. Collapsing the SBM 



In this section, we show how collapsing can be used to give a more con- 
venient and efficient expression for the model. This refers to the integration 



of nuisance parameters from the model, see Wyse and Friel (2012) for an 



application to a different, but related, bipartite model. The SBM has been 



partially collapsed by Kemp et al. (2004), but we will consider the full col 



lapsing of both vr and 9. As our primary interest is in the clustering z and 
the number of clusters K, we integrate out vr and 6, yielding an explicit 
expression for the marginal P{x,z,K). We emphasize that integration does 
not change the model, it merely yields a more convenient representation of 
the relevant parts of the posterior. This integration is made possible by the 
choice of conjugate priors for vr and 6. We treat i^ as a random variable and 
place a Poisson prior on K with rate A = 1, conditioning on i^ > 0, 



K r^ Poisson{l) \ K > 0, 



(1) 



which gives us 



P{K) 



K\' 



1 



1-P(i^ = 0) K\{e-l)' 

We are only interested in these expressions as functions of K and z up to 
proportionality, as this will be sufficient for our Markov Chain over {K, z\x), 
and hence we can simply use P{K) oc ■^^. 

The Poisson prior is used in the allocation sampler, the algorithm upon 
which our method is based (Nobile and Fearnside, 2007). This allows the 



estimation of the number of clusters as an output of the model rather than 
requiring a user to specify K as an input or to to use a more complex form 
of model selection. Thus, we have a fully Bayesian approach where, other 
than A^, which is taken as given, every other quantity is a random variable 
with specified priors where necessary. 



p(a;, TT, z, ^, AT) = P{K) x v{z.O\K) 

X p(x, 7r|z) . 



(2) 



With eq. ([2]) we could create an algorithm which, given a network x, 
would allow us to sample the posterior vr, z, 6', K\x. However, we are only 
interested in estimates of z, K\x. We now show how to collapse tt and 9. 

Define IR+ to be the set of non-negative real numbers, and write the set 
of real numbers between and 1 as [0, 1]. Define G the unit simplex i.e. the 



subset of M;^ where 1 = X]fc=i ^k- Define 11 to be the domain of vr. For the 
Poisson model this is MJ^ while for the Bernoulh model this is [0, l]'^, where 
B is the number of blocks. 

We can access the same posterior for z and K by collapsing two of the 
factors in eq. (|2]), 

V{x,z,K) = V{K)^ fp{z,e\K)d9 

f (3) 

X / p(x,7r|z) dvr , 



or, equivalently, using the block-by-block independence X(^ki)\z,K, 

F{x,z,K) = F{K)x fp{z,e\K)d9 

^« (4) 

X 



JJ / P{X{kl),TTkl\z) dlTM- 



k,l -"^'kl 

This allows the creation of an algorithm which searches only over K and 
z. The algorithm never needs to concern itself with 6 or tt. 

Collapsing greatly simplifies the sample space over which the MCMC 
algorithm has to search. Without collapsing, the dimensionality of the sam- 
ple space would change if our estimate of K changed; this would require 
a Reversible- Jump Markov Chain Monte Carlo (RJMCMC) algorithm (see 



Green (1995)). Finally, if estimates for the full posterior, including tt and 
9, are required, it should be noted that it is very easy to sample vr, 9\x, z, K, 
meaning that nothing is lost by the use of collapsing. Many of the other 
models described in section |4] are collapsible, and this may be an avenue for 
future research. 

The integration of eq. ^ allows an expression for the full posterior dis- 
tribution to be obtained. Details of the derivation of this expression are 
given in Appendix A. Let n^ be the number of nodes in cluster k. n^ is a 
function of z. For the Bernoulli model, let i/ki be the number of edges that 
exist in block kl, i.e. the block between clusters k and /. For the Poisson 
model, yki is the total edge weight, y is a function of x and z. Let pki be 
the maximum number of edges that can be formed between clusters k and /. 
For off-diagonal blocks, pki = nkUi. For diagonal blocks, pkk depends on the 
form of the network as follows, 



Pkk 



2nk[nk - I) 
\nk{nk + 1) 
rikiuk - 1) 
nl 



undirected, no self-loops 
undirected, self-loops 
directed, no self-loops 
directed, self-loops 



The full posterior may be written as 



(5) 



P(x, z, K) oc 



K\ 



tK 



T{aK)Ut=innk + a) 



(6) 



where the final product is understood to take place over all blocks. The form 
of the function f{x(^ki)\z) depends on the edge model. If Bernoulli, then 



f{X(kl)\z) 



B(/3i + ykhPki -yki + 1^2) 



B(/3i,/32) 
where Bi., .) is the Beta function. If Poisson, then 



(7) 



^{.s + Vki) 



f{x{ki)\z) 



Pki- 



s+Vkl 



Tis) 



(8) 



4. Related estimation procedures for the SBM 

Before defining our algorithm, we look at related work, particularly other 
methods that are based on the SBM. We will focus on models which are 
identical, or very similar to, the SBM. Therefore, we will not discuss other 



models which are loosely related, such as that of Newman and Leicht (2007), 
or the "degree-corrected" SBM of Karrer and Newman (2011). 



All methods discussed here are aimed at estimating z, but they differ in 
the approach they take to the parameters tt and 6 and in whether they allow 
the number of clusters, K, to be estimated. We also discuss the issue of model 
selection, i.e. how the various methods estimate the number of clusters. This 



question was avoided in the original paper of Nowicki and Snijders (2001) 
where the number of clusters is fixed to if = 2 in the evaluation. 



9 



The method of Daudin et al. (2008) takes a network, x, and number of 
clusters K, and apphes a variational algorithm. Point estimates are used 
for 71 and 6, but the clustering z is represented as a distribution of possible 
cluster assignments for each node. This makes the method analogous to 
the EM algorithm for the MLE - finding the pair (tt, 6) which maximizes 
F{x\TT,e,K). 



The model used by Zanghi et al. (2008 ) is a subset of the model of Daudin 



et al. (2008). The cluster-cluster density matrix, tt, is simplified such that it 



is represented by two parameters A and e, such that the on-diagonal blocks 
TTfcfc = A and the off-diagonal blocks tt^^ = e (for k ^ I). A Classification EM 
(CEM) algorithm to maximize argmax P(x, z\7t, 6, K) is briefly described in 



Zanghi et al. (2008) but not implemented. Instead, they implement an online 



algorithm. One node of the network is considered at a time and is assigned 
to the cluster which maximizes P(a;, z\ti,6,K), updating estimates of vr and 
9 with each addition. Implicitly, their goal is to use point estimates both for 
the parameters and for the clustering, to find (5, -n", 9) that would maximize 



'P[x,z\ti,9,K)] as such, it is loosely related to the profile likelihood ( [Bickel 
andChen||20fej ). 



The methods just discussed are based, directly or indirectly, on the fre- 
quentist approach of finding the maximum likelihood estimate of the param- 
eters, (vr, ^), i.e. the values vr, ^ that would maximize the likelihood of the 
observed network, 

P(x|7r, 9,K) = Y^ P(x, z\ti, 9, K) . 



The estimate of z that is used in this frequentist approach is the conditional 
distribution of z based on this point estimate of the parameters and on the 
observed network, z\x,t^,9,K. In practice though, it is not tractable to 
calculate or maximize this likelihood exactly, and hence a variety of different 
approximations and heuristics have been used. 

In a Bayesian method, such as ours, a distribution of estimates for (vr, 9) 
is used instead of a point estimate. The goal is to directly sample from 
z\x,K. Another example of this Bayesian approach is the variational algo- 



rithm used in Hofman and Wiggins (2008), which is based on the simpler A 



and e parameterization of the tt matrix used in Zanghi et al. (2008). 



The modelling choices of Latouche et al. (2012), where a new model 



selection criterion called ILvB is introduced, are essentially identical to the 
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standard SBM; each element of tt^i is independent, and conjugate priors are 
specified. A variety of other variational approximations are considered by 
Gazal et al. (2011), where there is more focus on parameter estimation and 



less focus on model selection. 

A further specialization of this model is possible, by employing the A, e 
parameterization, but where A > e, which explicity constrains the expected 
edge density within clusters to be larger than the expected edge density 
between clusters. This can be considered to be community-finding as opposed 



to block-modelling. The authors of this paper considered this in McDaid et al. 



(2012). 



4.I. Model selection 

Later, in our experiments in section |6| we will demonstrate the ability of 
the allocation sampler to accurately estimate the number of clusters. In this 
subsection, we will briefly discuss some of the theoretical issues around the 
estimation of the number of clusters. 

The methods that involve the MLE for the parameters involve the risk of 
overfitting; for larger values of K, the parameter space of n and 6 becomes 
much larger and therefore the estimates of P e=emiei^\-^) ^^^^ become over- 



optimistic, and will tend to overestimate K (Schwarz, 1978). Therefore, an 



alternative formulation such as the ICL is needed; see Zanghi et al. (2008 ) and 



Daudin et al. (2008 ) for derivations of the ICL in the context of models based 



on the SBM. Instead of using the MLE directly, those measures apply priors 



to the parameters and integrate over the priors, as described in Biernacki 



et al. (2000), such that the average likelihood is used instead of the maximum 



likelihood. 

Typically, such integrations cannot be performed exactly and the ICL 
criterion consists of approximations that are based on first finding an estimate 
to the MLE, and then adding correction terms to this MLE. For the rest of 
this subsection, we will not consider those approximate methods and will 
instead consider the exact solutions to the integrations. 

The integrated classification likelihood, which the ICL intends to approx- 
imate, 



P{x,z\K)= f fp{x,z,7i,e\K)d7rd7i, 



can be solved exactly in some models. The SBM is one of those models, and 
the posterior mass that our algorithm samples from is exactly equal to the 
integrated classification likelihood (if a uniform prior is used for K instead 
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of the default Poisson) . While it is easy to exactly calculate the integrated 
classification likelihood for a given [z, K), it would not be tractable to search 
across all possible [z, K) to find the state that maximizes the integrated 
classification likelihood, except for the smallest of networks. 

The BIG is an attempt to approximate the integrated likelihood 



P{x\K) = J2 P{x,z,7i,e\K)d7rde. 



An exact solution to the BIG is not tractable for the SBM; the likelihood 
would require a summation over all possible clusterings z. 

If we were to use a uniform prior for K, then P(a;|i^) = P{K\x) and 
an irreducible ergodic Markov chain algorithm such as ours would visit each 
value of K in proportion to the integrated likelihood for that value of K. Of 
course, our algorithm only gives a sample from the true posterior, and there 
cannot be any guarantee that the distribution of the sample is representative 
of the true distribution. 

The purpose of these last few paragraphs is to demonstrate that there 
are other (approximate) ways to calculate the integrated likelihood and the 
integrated classification likelihood. The Bayesian methods provide approxi- 
mations that may, in practice, be at least as good as the approximations that 
would be provided by methods such as the IGL. 



The model-selection criterion ILvb Latouche et al. (2012) is based on 
a variational approximation to a fully Bayesian model. As a result of its 
Bayesian model, it is an approximation of the integrated likelihood and no 
further adjustment is required for model selection. As with any variational 
Bayes method, we assume that the independence assumptions within the 
variational approximation are a good approximation of the true posterior. 
A second assumption made by those authors is that the KuUback-Leibler 
divergence, the difference between the true posterior and the variational ap- 
proximation, is independent of K. If these two assumptions hold, then the 
measure they use, which they call the ILvB, is equivalent to P{x\K), the 
integrated likelihood. To select the number of clusters, they use that value of 
K which maximizes the ILvB. 

5. Estimation 

In this section, we describe our MGMG algorithm which samples, given 
a network x, from the posterior K.,z\x. The moves are Metropolis-Hastings 

12 



moves (Hastings, 1970). We define the moves and calculate the proposal 



probabilities and close the section with a discussion of the label-switching 



phenomenon, where we use the method proposed in Nobile and Fearnside 



(2007) to summarize the clusterings found by the sampler. 



Our algorithm is closely based on the allocation sampler Nobile and 



Fearnside ( 2007 ) , which was originally presented in the context of a mixture- 
of-Gaussians model. In fact, it can be applied to any model that can be 
collapsed to the form P{x,z,K) where x is some fixed (observed) data and 
the goal is to sample the clustering and the number of clusters {z, K). 



In the Gibbs sampler used in Nowicki and Snijders (2001 ), the parameters 



are not collapsed, and sampling is from 

Z, TT, 9\x, K. 

In their experiment on the Hansell dataset, K was fixed to 2. As a re- 
sult of this value for K, 9 reduced to a single real number specifying the 
relative expected size of the two clusters. Expressions were presented for 
p(6'|z,7r,a;. A'), P{z\6,ti,x,K) and p{ti\z ^ 6 ^ x , K) such that the various ele- 
ments Zi (or Tiki) are conditionally independent of each other, given (vr, x, K) 
(or (z,x,-ft')), allowing for a straightforward Gibbs sampler. 

In contrast, we develop an algorithm that searches across the full sample 
space of all possible clusterings, 2;, for all K, drawing from the posterior, 

z, K\x, 

using eq. ^ as the desired stationary distribution of the Markov Chain. 
We use four moves: 

• MK: Metropolis move to increase or decrease K, adding or removing 
an empty cluster. 

• GS: Gibbs sampling on a randomly-selected node. Fixing all but one 
node in z, select a new cluster assignment for that node. 

• M3: Metropolis-Hastings on the labels in two clusters. This is the 



M3 move proposed in Nobile and Fearnside (2007). Two clusters are 



selected at random and the nodes are reassigned to the two clusters 
using a novel scheme fully described in that paper. K is not affected 
by this move. 
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AE: The absorb-eject move is a Metropolis-Hasting merge/split cluster 



move, as described in Nobile and Fearnside (2007). This move does 
affect K along with z. 

At each iteration, one of these four moves is selected at random and at- 
tempted. All the moves are essentially Metropolis-Hastings moves; a move 
to modify z and/or K is generated randomly, proposing a new state {z', K'), 
and the ratio of the new density to the old density pj-^ 'j^!^) = pj-^'!, 'j^! is calcu- 
lated. This is often quite easy to calculate quickly as, for certain moves, only 
a small number of factors in eq. ^ are affected by the proposed move. We 
must also calculate the probability of this particular move being proposed, 
and of the reverse move being proposed. The proposal probability ratio is 
combined with the posterior mass ratio to give us the move acceptance prob- 
ability, 

'^''' {'' P{x,z,K) Pprop((i^,^) ^ iK',z'))J ' ^^^ 

where Pp^op{{K,z) — )■ {K',z')) is the probability that the algorithm, given 
current state {K,z), will propose a move to {K',z'). 

In the remainder of this section, we discuss the four moves in detail, derive 
the proposal probabilities and describe the computational complexity of the 
moves. 

5.1. MK 

The MK move increases or decreases the number of clusters by adding 
or removing an empty cluster. If MK is selected, then the algorithm selects 
with 50% probability whether to attempt to add an empty cluster, or to 
delete one. If it chooses to attempt a delete, then one cluster is selected at 
random; if that cluster is not empty, then the attempt is abandoned. If it 
chooses to attempt an insert, it selects a new cluster identifier randomly from 
{1, . . . , i^ + 1} for the new cluster and inserts a new empty cluster with that 
identifier, renaming any existing clusters as necessary. 

The proposal probabilities are 
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0.5 
K + 1 

|f liK' >l 
otherwise 



By adding an empty cluster, K increases to K' = K +1 and the posterior 
mass change is: 

/ r(a(x+i))na^rK+«) 

Vix.Z.K') K\ \ T{a)K+-^T{N+a{K+l)) 



\^ T{a)'<T{N+aK) 

V{a{K + l))V{N + aK) 



{K + l)T{aK)T{N + a{K + 1)) ' 
The computational complexity of this move is constant. 

5.2. GS 

The Gibbs update move, GS, selects a node i at random to be assigned to a 
new cluster. All other nodes are kept fixed in their current cluster assignment 
i.e. a single element of the vector z is updated. Denote by z' = z^^i^k} the 
modified clustering resulting from a move of node i to cluster k. For each 
possible value oi Zi G {1, . . . , K}, Zi is chosen with probability proportional 
to P{x, Z{zi^k},K). The proposal is then accepted. Bear in mind that this 
move often simply reassigns the node to the same cluster it was in before the 
GS move was attempted. 

The calculations involved in GS are quite complex as many of the factors 
in eq. (|6| are affected. The sizes of the clusters are changed as the node is 
considered for inclusion in each cluster, and the number of edges and pairs 
of nodes are changed in many of the blocks. The computational complexity 
is 0{K^) + 0{N) as every block needs to be considered for each of the K 
possible moves and every node may be checked to see if it is connected or 
not to the current node. 

The 0{N) term is just a theoretical worst case over all possible networks. 
Our algorithm iterates over the neighbours of the current node and this is 
sufficient to perform all the necessary calculations. There is no need to iterate 
over the non-neighbours and therefore the average complexity is equal to the 
average degree, which will be much less than N in real-world sparse networks. 
For small Kk, and assuming a given average degree, the complexity of the 
GS move is independent of A^. 

5.3. MS 



MS is a more complex move and was introduced in Nobile and Fearn 



side (2007). Two distinct clusters are selected at random, j and k. All 
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the nodes in these two clusters are removed from their current clusters and 
placed in a list which is then randomly reordered - call this ordered list 
A = {ai, . . . , a„^+„^}, of size equal to the total number of nodes in the two 
clusters. The software creates a temporary cluster to store these nodes until 
they are reassigned to the original two clusters. One node at a time is se- 
lected from A and is assigned to one of the two clusters according to some 
assignment probability. As the nodes are assigned (or reassigned) the new 
cluster assignments are stored in a list Bh = {61, ... , bh-i}, where bi is the 
new cluster assignment of node Oj and Bh represents the assignments before 
the h^^ node in A is processed. 

Iterating through the list A, a^ is assigned to either cluster j or cluster k 
with probability satisfying 

conditional on the nodes Bh that have already been (re-) assigned. Concep- 
tually, any arbitrary assignment distribution can be chosen, as long as the 
probabilities for each choice are non-zero and sum to one. Once all nodes in 
the list have been assigned to the two clusters, the proposal probability is 
given by 

rij+nk 

pp.op(^ -^z')= n vv''^ . 
h=i 

We remark that while the order in which the nodes are reinserted is 
random, it can be shown that this random ordering does not affect the ac- 
ceptance probability. 



In Nobile and Fearnside (2007), it is proposed to choose the ratio of 
the assignment probabilities as the ratio of the two posterior probabilities 
resulting from the assignments of the first h nodes. Specifically, denote by 
Z{ah^i,Bh}, the clustering that assigns the first h — 1 nodes of A according 
to Bh and assigns ah to cluster I. Let P{x' ,zs^ah-^i,Bh},K) be the posterior 
probability of this clustering on the network x' where all unassigned nodes 
and edges involving these nodes are ignored. Then 

pg;"^' ^ P(x^^|,,^,■B4,ir) 

This heuristic should guide the selection towards 'good' choices. To calculate 
the proposal probability of the reverse proposal, the list A is again traversed 
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to calculate 

h=l 

where B'f^ = {za„...,Za^_,}. 

Our algorithm has been optimized for sparser networks. The complexity 
of MS is made up of three terms. First, it is possible that many or all nodes 
will be reassigned, causing a complexity of 0{N) while updating the data 
structure that records the size of each cluster. Second, we keep a record 
of the number of edges within each block; the M3 move will consider each 
edge in the network at most once, as it moves the edge from one block to 
another, leading to a complexity of 0{M), where M is the number of edges 
in the network. Finally, once the data structures have been updated, a new 
posterior mass must be calculated by iterating over each cluster and over 
each block, querying the summary data structures, to sum the new terms in 
eq. ([6]); this has a complexity of 0{K'^). 

Together, this gives a complexity of 0{N) + 0{M) + 0{K^). The first 
term may be ignored, since for most networks that are considered here and in 
the literature, M > N. As long as the number of clusters is small, K^ ^ M, 
the 0{M) term dominates. While in the worst case M = N'^, in practice, 
for the sparse networks we consider, M <^ A^^. 

5.4. AE 

In the absorb- eject AE move, a cluster is selected at random and split 
into two clusters, or else the reverse move can merge two clusters. This move 
therefore can both change the number of clusters K and change the clustering 
z. The move will first choose, with 50% probability, whether to attempt a 
merge or split. 

In the case of the split move, one of the K clusters is selected at random. 
Also, the cluster identifier of the proposed new cluster is selected at random 
from {1,...,A' + 1}. Finally, the nodes in the original cluster are assigned 
between the two clusters. This is similar to the MS move and a heuristic to 



guide the assignment, as in MS", could be considered. Instead, as in Nobile 



and Fearnside| ( 2007[ ), we use a probability of ejection, pe, selected randomly 



from a Uniform(0, 1) distribution, such that each node is assigned to the new 
cluster with probability, pe- In such as move, the proposal probability is 
dependent on pe- Rather than specify an ejection probability, we integrate 
over the choice of pe in much the same manner as collapsing. 
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Given {z, K) and a proposal to split into {z\ K' = K + 1), where a cluster 
of size Hk is split into clusters of size rij^ and rij^, the resulting proposal 
probability for an eject move is 

p ((z K) ^ (z' K')) = rK + i)r(n,, + 1) 

For a merge, the proposal probability is simply obtained as the probability 
of selecting the two clusters for merger from the K' = K+1 possible clusters. 
One cluster is selected which will retain its current nodes and which will 
expand to contain the nodes in another, randomly selected, cluster. 



KK + 1 
The complexity is similar to that of the M3 move. 

5.5. Applying the moves 

In all simulations, discussed in section [6| the algorithm is seeded by ini- 
tializing K = 2 and assigning the nodes randomly to one of the two initial 
clusters. The first two moves, MK and GS, are sufficient to sample the space 
but have slow mixing. The AE move is sufficient on its own as it can add 
or remove clusters as well as move the nodes to reach any {z, K) state. In 
practice, we'll see in section |6] that the combination of AE and M3 is good 
in the initialization stages to burn-in to a good estimate of both z and K 
and lessen the dependence on the initialization. It is possible to envisage 
many possible extensions to these moves. For example, a form of M3 could 
be made which selects three clusters to rearrange. The AE move could be 
extended to include the assignment heuristic of the MS move. 

5.6. Label Switching 

For any given z, with K clusters (assuming they are all non-empty), there 
are K\ ways to relabel the clusters, resulting in K\ effectively equivalent 
clusterings. The posterior has this symmetry and as the MCMC algorithm 
proceeds it often swaps the labels on two clusters, in particular during the 
MS move. This is known as the label switching phenomena. The posterior 
distribution for any Zi\x,K assigns node i to each of the K clusters with 
probability -^, so in the long run every node is assigned with equal prob- 
ability to every cluster. While each Zi is uniformly distributed between 1 
and K, the components of z are dependent on each other and pairs of nodes 



that tend to share a cluster will tend to have the same values at their cor- 
responding component of z. Depending on the context, this may not be an 
issue of concern. For example, if the aim is to estimate K or to estimate the 



probability of two nodes sharing the same cluster, see fig. 3(b) , or to estimate 
the size of the largest cluster, then label switching is not a problem. 

However, it sometimes is desirable to undo this label switching by rela- 
belling the clusters, such that nodes are typically assigned to a single cluster 
identifier along with those other nodes that they typically share a cluster 
with. Such a relabelling can, for example, make it easier to identify the 
nodes which are not strongly tied to any one cluster. 



We use the algorithm in Nobile and Fearnside (2007) to undo the label 



switching by attempting to maximize the similarity between pairs of cluster- 
ings, after the burn-in clusterings have been discarded. Given two z vectors, 
at two different points in the Markov Chain, t and u, define the distance 
between them to be 



N 



D(.w,.(«)) = j:/(.f),.f"^: 



where / is an indicator function that returns if node i is assigned to the 
same cluster at point t and point u\ and returns 1 otherwise. 

For each z^^^ ^ consider 2;*^**\ one of the K\ possible relabelled versions of 
z^*^ The Markov Chain is run for a iterations, discarding the first h iterations 
as burn-in. 

Ideally, the goal is to find the relabelling that minimizes the sum over all 
pairs of u and t, 

j^Y.D{z^^'\z^^-y), 

t=b u=t+l 

but it is not computationally feasible to search across the full space of all 
relabellings. Each state can be relabelled in approximately K\ different 
ways, the precise number depends on the number of non-empty clusters. 
There are a states altogether, therefore the space of all possible relabellings of 
all states will have {K\)°- elements; this will be untractable for non-negligible 
a. In our experiments, a tends to be of the order of one million. 



Instead, we use the online algorithm proposed in Nobile and Fearnside 



(2007). It first orders the states from the Markov chain by the number of non- 



empty clusters. Then, it iterates through the states, comparing each state to 
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all the preceding relabelled states and relabelling the current state such that 
the total distance to all the preceding relabelled states is minimized. 

We will see in section [7] how this algorithm helps to summarize the output 
of the Markov Chain. This algorithm is fast. On a 2.4 GHz Intel Zeon in 
a server with 128GB RAM, it takes 43 seconds to process the output of 1 
million iterations of that data. In comparison, it takes 610 seconds to run 
the SBM MCMC algorithm in order to get the states to feed into the label- 
unswitching algorithm. Note also that the algorithm doesn't take up much 
memory — even with a network with 10 million edges, the memory usage 
doesn't exceed 2GB. 

Once the label-switched set of states is obtained, a posterior distribution 
of the clustering for each node, Zi\x, can be calculated. There is a similarity 



here with variational methods Daudin et al. (2008); Latouche et al. (2012) 



as they model the posterior in this manner, where each node's variational 
posterior is independent of the other nodes' variational posterior. It may 
be interesting to compare these approximate posteriors to the approximate 
posterior found by our method. 

In the experiments we perform later in sections |6] and [7| the vast majority 
of nodes are strongly assigned by this label-switching algorithm to one of 
the clusters with at least 99% probability in the posterior. Therefore, the 
distance -D(., .) between each state and this 'summary' state is usually quite 
small. We take this as an indication that the online heuristic has done a 
reasonable job of minimizing the distance between the states, at least for 
those networks. 

6. Evaluation 

In this section we first look at experiments based on synthetic data and 
follow in the next section with an application of the collapsed SBM to a 
survey network gathered by one of the authors at a recent summer school. 
The synthetic analysis proceeds by generating networks of various sizes from 
the model and examining whether the algorithm can correctly estimate the 
number of clusters and the cluster assignments. 

As mentioned in the previous section, all our experiments are done on a 
2.4 GHz Intel Zeon in a server with 128GB RAM, and the memory usage 
never exceeded 2GB. 
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6.1. Estimating z 

A 40-node directed, unweighted network is generated from the model, 
containing 4 clusters of 10 nodes each. The block densities iiki are generated 
by drawing from a Uniform(0, 1) = Beta(l, 1) for each of the 4 x 4 = 16 
blocks. 




5 10 15 20 25 30 35 



Figure 1: The adjacency matrix (with S — 0) for the four-cluster synthetic network used 
in section \6A\ Each of the four clusters has 10 nodes. 



To challenge the algorithm further we add noise to the synthetic data, 
similar to simulation experiment described in section 4 of |Wyse and Friel 
(2012). The values in the matrix vr are scaled linearly. For a given 6, define 



Tij^i = S + 7Cki{l — 25). While the values in the original vr are drawn from the 
full range, [0, 1], the elements in the matrix vr^*^^ are in the range {6, 1 — 6). 
Various networks for values of 6 between and 0.5 are generated. The original 
network model corresponds to 5 = 0. The network with S = 0.5 corresponds 
to an Erdos-Renyi model with p = 0.5 — this is a random graph model with 
no block structure. 

The algorithm is run for one million iterations, discarding the first 500,000 
of these as burn-in. 
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Table 1: The performance decreases as the noise level, S, increases. The fifth column, 
P(z = z\x), reports how often the sampler visits the 'correct' answer; i.e. where the 
visited state was equivalent, subject to relabelling, to the model from which the network 
was generated. 

Table [l] shows how the performance is affected as 6 increases. The first 
column is the posterior probability for the "correct" answer for K, P{K = 
4|x). As the value of 6 increases, the network approaches the Erdos Renyi 
model and therefore there is no longer any structure to detect; this explains 
why the accuracy decreases as 6 increases. Next is the modal value of K 
which maximizes the posterior P{K\x), followed by the posterior probability 
of the modal value, P{K = i^modela^) • The fifth column, P{z = z\x) is 
the probability that the (non-empty) clusters are equivalent (allowing for 
relabelling) to the clustering used to generate the data. Note that sometimes 
there are empty clusters in the estimate and therefore P{z = z\x) can be 
bigger than P{K = 4\x). 

The final column reports r, the Integrated Autocorrelation Time (lAT) 
for the estimate of K, defined as r = 1 + 2 Ylt^i P{^)i where pt is the autocor- 
relation at lag t. As the sampler visits the states, we consider how correlated 
the estimate of K is with the estimates for preceding states. A low autocor- 
relation, as summarized by the lAT, is an indicator of good mixing. 

6.2. Estimating K 

We perform three different types of experiments to judge the ability of 
the algorithm to correctly estimate the number of clusters with networks of 
increasing size. 
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First, we repeat the experiments of Latouche et al. (2012). The true 
numbers of clusters, Ktme is set to range from 3 to 7. For each Ktme, 100 
networks are randomly generated. The number of nodes in each network, 
N, is set to 50. The nodes are assigned to the clusters randomly, with 
6i = ■ ■ ■ = 6k = jr" — • Two parameters are used to control the density of 
the blocks. The first. A, is the density within clusters i.e. n^k = A. Also, 
one of the clusters is selected to be a special cluster of 'hubs', well connected 
to the other nodes in the network, by setting nik = i^ki = A. The second 
parameter, e, represents the inter-block density of all the other blocks i.e. 
TTki = e for k,l ^ 1. As in the experiments of Latouche et al. (2012), the 
parameter values are A = 0.9, and e = 0.1. 
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Table 2: The rows represent Ktme and the columns arc the estimates from the ILvb of 



Latouche et al. (20121 and from our algorithm. 



Each network is run through the variational method of Latouche et al. 
(2012). The estimated value of K which maximizes the ILvB measure is 



taken as the estimate of the number of clusters. A contingency table, showing 
the true number of clusters against the estimate from ILvB, is displayed in 
table 2|(a) For low Ktme the algorithm is very accurate, and for larger values 
there is a tendency to underestimate the number of clusters. For example, 
when Ktrue = 7, the estimate was K = 6 for 41 of the networks and K = 7 
for only 25 of the 100 networks. 

The results from our algorithm, shown in table [2 [(bj are similar to those 
obtained using the ILvB. 

6. 3. Synthetic SBM networks 



The experiments of section 6^ involve synthetic data generated according 
to a model of community structure, where edges tend to form primarily within 
clusters. In order to explicitly test our algorithm in the more general setting 
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of block structure, we generated another set of networks with data generated 
directly from the SBM. 

Similarly to the previous experiment, for each of a range of values of Ktme, 
100 networks are generated. KTme is now set to range from 10 to 20 and the 
number of nodes is set to A^ = 100, in order that the size of each cluster not 
be very small. Each element of T^ki is chosen randomly from Uniform(0,l) 
and for each of the 100 networks, a new vr is created randomly. As these 
are undirected networks, only the upper triangular portion of vr is used when 
generating the network. Again, we compared the estimates of K found by 
the ILvB to those found by our algorithm. 
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Table 3: The true number of clusters (rows) against the number estimated by ILvh 
(columns). The diagonal entries are underlined to aid readability, as these represent the 
correct answer. We see here a tendency to underestimate the number of clusters, especially 
for larger Kxrue- 

The results are shown in tables [3] and |4j Each row of data represents 
the 100 networks generated for a given Kxrue- Each column represents the 
estimated K. Ideally, the algorithm would correctly estimate the number of 
clusters in most cases, corresponding to large number on the diagonal. We 
have underlined the diagonal entries for clarity. Note that the sum of the 
entries in each row does not always sum exactly to 100, since there are cases 
where the algorithms underestimate or overestimate the number of clusters, 
beyond the shown range. For the ILvb algorithm, it is necessary to specify 
a range of K to be tested; we specified the range from 5 to 30. Our MCMC 
algorithm requires no such hint. 
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Table 4: The true number of clusters (rows) against the number estimated by our collapsed 
MCMC algorithm (columns). The diagonal entries are underlined to aid readability, as 
these represent the correct answer. The accuracy is better here than in table [31 we can 
see that the numbers on the diagonal are larger. 



For networks with a small number of clusters, both algorithms perform 
well, with 72% accuracy for ILvb and 95% accuracy for our algorithm. As the 
true number of clusters increase, the performance decreases. Our algorithm 
maintains at least 50% accuracy in all cases, whereas the accuracy for ILvb 
falls to 23%. When they are incorrect, both algorithms have a tendency to 
underestimate the number of clusters. 



In section 6.4, a more thorough investigation of the speed and scalability 
of our algorithm with respect to larger networks is given but we close our 
comparison with ILvb with some remarks on speed. For the first set of small 
networks above, both methods are very fast; they complete within seconds. 
For example, the ILvb can be calculated for all values of K from 10 to 20 in a 
total under five seconds. We have not defined a convergence criterion for our 
MCMC algorithm, and therefore we make no attempt to halt the sampling 
early in order to define a 'runtime' for our algorithm. But in the occasions 
where both methods get the correct result, our algorithm typically reaches 
the correct result within nine seconds; and the sampler remains at, or very 
close to, the correct clustering for the remainder of the run. 

Finally, to demonstrate the importance of the AE move, in fig. |2| the 
time taken by our algorithm to reach the correct clustering for three synthetic 
networks is shown. The numbers of clusters in the networks is 5, 20, and 
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Figure 2: The estimates of K in the synthetic networks, with K — 5, 20, 50. The x-axis 
(logarithmic scale) is the number of iterations; as the algorithm proceeds, in each case it 
converges on the correct estimate of K. The networks had 10 x K nodes each. In the lower 
plot, we see the performance where where the AE move has been disabled; demonstrating 
how it is important in burnin. 

50 respectively, with vr drawn from Uniforni(0, 1). In each case, there were 
exactly 10 nodes in each cluster, giving A^ = 10 x i^ nodes in each network. 
The X-axis displays the number of iterations and the y-axis the number of 
clusters at that stage in the run of the sampler. The correct clustering is 
reached in approximately 10,000 iterations. 

We found that the AE move is quite important, at least in the early 



stages. If AE is disabled, see fig. 2(b) , then it takes about 320,000 iterations 
for K=50, instead of just 20,000 iterations when all moves are in effect. 
For fast burn-in, M3 and AE are necessary. With similar experiments we 
noticed that, once the chain has burned in, the MS move is sufficient for 
good performance and the other simple moves, GS and MK, do not make 
major contributions. 

6.4- Larger networks 

Next, we investigate larger networks to demonstrate the scalability of the 
algorithm. 

A number of synthetic networks are generated, each with approximately 
ten thousand nodes and ten million edges. The number of clusters ranges 
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from 3 to 50, and the number of nodes in each clusters, O, is set such that 
the total number of nodes, A^ = i^ x O, is close to 10,000. If we use the 
default SBM edge model, then the number of edges would be approximately 
50 million. As this would take up a lot of computer memory, instead we 
modify the prior for the per-block densities to be Uniform (0,0. 2) in order to 
ensure that the expected number of edges is 10 million. Large real-world 
networks are typically quite sparse, even more sparse than this synthetic 
network. The details, including the speed and accuracy, are in table [5j 

The SBM algorithm is run for 100,000 iterations on each of these networks 
and the time to converge is recorded. In each case, when the algorithm 
first visits the 'correct' state, it remains in that state for practically all the 
remaining iterations. We record the number of iterations taken before the 
algorithm reaches the correct state, and the time that has elapsed at that 
point. It typically converges within one hour, but it takes nearly four hours 
for the 50-cluster network. Methods that scale to thousands of nodes have 



been presented in the literature, such as Daudin et al. (2008) and Latouche 



et al. (2012). To our knowledge, ours is the only method which has been 
demonstrated on networks with 10,000 or more nodes. 

We have attempted to load these networks into the R software package 
in order to run them through ILvb. However, the memory requirements 
for such large adjacency matrices become prohibitive. For large networks, 
it may be necessary to consider a different implementation language and 
techniques in order to fully explore the scalability of a variational method 
such as ILvb. Instead, we generated five 500-node networks, with 20 clusters 
each, according to the SBM model and run ILvB on it, using only one value 
of K, namely K = 20. It takes between 38 and 56 seconds, depending 
on which of the five networks is used. In comparison, on the same data, 
our algorithm takes between 17 and 35 seconds, despite the fact that it is 
given no clue as to the correct value of K. With 1,000- node networks, the 
runtimes for ILvb are between 636 and 814 seconds, whereas our algorithm 
takes between 55 and 78 seconds. This suggests our algorithm scales better 
than the ILvb - although perhaps this is an implementation issue rather than 
a limitation of the variational model. 

In practice, it is necessary to run ILvb for every possible value of K, and 
this fact should be incorporated into any evaluation of its runtime. For larger 
networks, the range of possible values of K increases making this a significant 
issue. In contrast, an algorithm based on the allocation sampler, such as 
ours, does not suffer this limitation, suggesting that that our algorithm is 
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1,667 
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7 
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3,710 
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1,111 


9,999 


9,581,440 


1,383 


4,052 
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1,000 


10,000 


9,989,886 


1,277 


3,785 


20 


500 


10,000 


9,871,938 


5,655 


4,779 


30 


333 


9,990 


9,821,594 


12,497 


6,999 


40 


250 


10,000 


9,862,703 


37,742 


12,452 


50 


200 


10,000 


10,008,963 


40,958 


24,028 



Table 5: The time-to-convergence for the larger synthetic networks. The networks have 
N = KxO nodes, made up of K clusters each with O nodes. After i iterations {t seconds), 
the algorithm reached the correct result and remained in, or close to, that state for the 
remainder of the 100,000 iterations. It should be noted that much of the runtime is simply 
taken up with loading the network into memory; the time spent in the MCMC algorithm 
itself is smaller than the t figure presented here. 

well suited to large networks. 

6.5. Autocorrelation in K 

An autocorrelation analysis can reveal the mixing properties of the algo- 
rithm. However, in the above examples, and in the survey data discussed in 
section [7], the estimates of K are very much peaked around a single value. 
Often the larger values of K are associated with empty clusters and the esti- 
mate of the number of non-empty clusters is even more peaked. This makes 
it difficult to use K as an interesting variable on which to perform autocorre- 
lation analysis. To address this, we examine the 6-node network in fig. 3(a)| 



for which a greater variance in the values of K is observed. Define Ki to be 
the number of non-empty clusters, Ki < K. The posterior predictive prob- 
ability for f^ = 2 is 57.0%, and for K = 3 it is 31.4%. For the non-empty 
clusters, it is 73.4% for Ki = 2 and 24.4% for Ki = 3. The autocorrelation 



in the estimates of K is shown in fig. 3(c 



The acceptance rates on this small 6-node network are relatively high: 
8.1% for MK, 4.2% for GS, 20.5% for AE, 46.0% for M3 . We'll see lower 
acceptance rates in the next section when the algorithm is applied to the 
survey network. 
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ity of two nodes sharing a cluster. 



(c) Autocorrelation on K. 



Figure 3: Adjacency matrix used in the analysis of varying K in section 6.5 Figure [3(b)| 
estimates, for every pair of nodes, the predicted probability of them sharing a cluster. 



Figure 3(c) shows the autocorrelation in the estimate of K. 
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Figure 4: The interation survey network of sectionlTj Node-to-cluster membership matrix. 
74 rows, one for each participant. There are 8 columns, one for each of the main seven 
clusters, and an extra cluster which, with very small probability, is occupied by some 
nodes. Most nodes are strongly assigned to one cluster, but the grey areas off the diagonal 
show a small number of nodes that are partially assigned to multiple clusters. 



7. Survey of interaction data 

A survey was performed by a team involving one of tlie autliors of tliis 
paper at a summer scliool. We asked tlie 74 participants to fill in a survey 
and record which other participants they knew before the summer school and 
also which participants they met during the school. 40 of the participants 
responded and gave us permission to make their survey response available 
publicly in anonymized format. We created a directed, unweighted, network 
from the data by linking A to B if A recorded either type of relationship 



with B, resulting in 1,138 edges. This network data is available at https: 
[//github . com/aaronmcdaid/Summer-school-survey-network, 



Using the procedure described in section 5.6 , we are able to summarize the 
output of the Markov chain in fig. |4j This is a matrix which records, for each 
(relabelled) cluster and node, the posterior probability of that participant 
being a member of that cluster. Each row represents one participant of 
the summer school, and the total weight in each row sums to 1.0. We have 
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ordered the rows in this figure in order to bring similar rows together, helping 
to highlight the sets of nodes which tend to be clustered together in the 
Markov Chain. As may be observed, most of the participants are strongly 
assigned to one cluster. Every node is assigned to one of the clusters with at 
least 75% posterior probability, and the majority of nodes have at least 99% 
posterior probability. 
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Figure 5: The interaction survey network of section IT] as a 74x74 adjacency matrix for 
the 74 participants in the summer school. 7 clusters were found by our method, and this 
matrix is ordered by the summary clustering found by the label-unswitching method of 
section [5^ In the text in section [7j we interpret the clusters found and show how many 
of the clusters correspond to the different types of people that attended the event. There 
were 33 people who did not respond, these can be seen in the last two clusters. 

The number of clusters selected is 7, with 90.7 % posterior probability. 
We can summarize this into a single clustering by assigning each node to its 
'best' cluster as found by the label-unswitching procedure. In fig. [5} we see 
this clustering. This particular clustering (or label-switched equivalents) has 
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posterior probability of 20.7%. (The order in which the clusters are presented 
is different in fig. [s] than in fig. |4]) 

We then analyzed the clusters to see if they could be meaningfully inter- 
preted. The first thing that stands out is that the final two rows of blocks 
are empty; these are simply the 33 people who did not respond to the sur- 
vey. It is interesting to see that the non-respondents have been split into two 
clusters. Looking at the final two columns of blocks, the differences in how 
other clusters linked to the non-respondents can be seen. 

With the help of one of the organizers, we verified that the second cluster 
(counting from the top, or from the left) is made up of the Organizers of the 
summer school, with one exception. These were people based in the research 
institute who were involved in organizing the summer school. Therefore, it is 
no surprise that the corresponding rows and columns of the adjacency matrix 
in fig. [5] are quite dense. The Organizers interacted with almost everybody. 

The third and fourth clusters are also made up of people who are based in 
the research institute where the summer school was hosted but who weren't 
on the programme committee. We call these Locals. The first cluster is made 
up of Visitors. These were people from further afield who attended the school 
and spoke at the summer school. Looking at the blocks at the top left of fig. [5] 
you can see that the Locals know each other and the Visitors interacted with 
each other. But the two groups do not tend to interact strongly with each 
other. The Organizers are the glue that hold everybody together. The fifth 
cluster appears simply to be made up of participants who did not interact 
very much with anybody - in fact they did not even interact with each other. 

We can now interpret the fact that there are two clusters of non-respondents. 
One of those clusters (the sixth cluster) is made of up of local people. Their 
names appeared in the surveys of the Organizers and Locals. The final clus- 
ter, the other non-respondent cluster, is made up of a broader range of people. 
It includes many non-responder Visitors, including many of the speakers at 
the summer school. 

A community finding algorithm would not have been able to find these 
results, as it would expect to find dense clusters and is tied to the assumption 
that the probability of pairs of nodes being connected is, all other things being 
equal, greater if they share a cluster than if they do not share a cluster. This 
would manifest as dense blocks on the diagonal of this adjacency matrix. 
Clearly, a community-finding algorithm could not find the non-respondent 
clusters. Also, a community finding algorithm might have merged the Or- 
ganizers and Locals clusters. This is because those two clusters are quite 
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dense internally and also have many connections between them. The only 
difference between these two clusters is how they interact with the rest of 
the network; this demonstrates how the rich block structure of the Stochastic 
Block Model, including the various cluster-cluster interactions, can be helpful 
in clustering this data. 

We ran the algorithm for 1 million iterations on this survey data, dis- 
carding the first 500,000 iterations as burn-in. The acceptance rates were as 
follows: 2.3% for AE, 64.6% for M3, 1.1% for MK. In the case of the Gibbs 
sampler, 2.5% of the time it assigned a node to a new cluster, otherwise the 
node was reassigned to its old cluster. 

The MS and AE are both Metropolis-Hastings; a change to the clustering 
is proposed and then the change is accepted or rejected. Sometimes the 
accepted move actually places all the nodes back to the same position they 
were in, or sometimes it merely swaps the labels between the two clusters. 
If we consider these as 'rejections', then the rate and which new states are 
reached is just 1.0% for MS. So, MS is accepted a lot, but it usually only 
moves between label-switched equivalents; this tells us that the algorithm is 
able to move quickly between the various modes of the distribution, and also 
suggests that the posterior is quite peaked around the modes. 

7.1. Estimating the Network Probability, P{x) 

In Section 4, we discussed how the fully Bayesian approach of the SBM 
presented here allows model selection criteria such as the ICL to be avoided 
to select between models with different input numbers of clusters K. It 
is also worth remarking that in certain circumstances, such as our survey 
data presented here, it is possible to compute an estimate of the network 
probability, P{x); that is, the probability, given just the total number of 
nodes A^, that the network x is observed from an SBM. This provides an 
absolute measure of the fit of the SBM to the observed data and could be 
used to test the hypothesis that the data is drawn from an SBM against some 
alternative model. 

In the survey data there is one clustering where it, along with its label- 
switched equivalents, take up 20.7% of the posterior probability. Call this z. 
Thus we have a value z which is visited very often by the sampler and this 
allows an accurate estimate of P{K, z\x) to be obtained using 

7! X P{K = 7,z = z\x) = 0.207. 
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Now inserting x, K and z into the expression for the joint distribution, 
an estimate of P{x) can be obtained using 

V{x)V{K = 7,z = z\x) = P(x, z = z,K = 7). 

In the case of the survey data we obtain log2P(x) ~ —2,482. To put 
some perspective on this value, we can compare with a model that selects x 
uniformly at random from all possible directed networks over N = 74 nodes. 
In this case, we obtain log2P(a;) = —N{N — 1) = —5,402. As a second 
alternative, if x were generated from an Erdos-Renyi model, averaged over 
all possible edge probabilities drawn uniformly at random, then log2P(x) ~ 
-4,130. 

8. Conclusion 

The original stochastic blockmodel was tested on a small network with 
two clusters. We have shown how Bayesian models, collapsing, and modern 
MCMC techniques can combine to create an algorithm which can accurately 
estimate the clusters, and the number of clusters, without compromising on 
speed. 

It is sometimes stated that MCMC is necessarily slower than other meth- 



ods, "effectively leading to severe size limitations (around 200 nodes)" (Gazal 



et al. , 2011). The MCMC method we have presented scales to thousands of 
nodes, and is more scalable than a recent variational method. We do not 
claim that MCMC will always be necessarily faster than the alternatives, but 
we observe that comments on the scalability of Metropolis-Hastings MCMC 
depends on the particular model and on the particular proposal functions 
used. It may be an open question as to which methods will prove to be most 
scalable in the long term, as further improvements are made to all methods. 
Our application to the survey data demonstrated that block-modelling 
can detect structure in networks that might be missed by community-finding 
algorithms. Sometimes the links between clusters are more interesting than 
the links within clusters. 
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Appendix A 

Here, we describe tlie integrations which show that eq. Q is equivalent 
to eq. (@. 

A.l Collapsing 9 

Here, we show how to calculate 

V{z\K) = [ p{z,e\K)d9. (10) 

Je 

This corresponds to the first integration expression in eq. ([s]). z is a vector 
which records, for each of the A^ nodes, which cluster it has been assigned 
to. The probability for each cluster is in a vector 6, where 



K 



fc=l 
We integrate over the support of the Dirichlet distribution, which we have 



denoted with 9 in eq. ( 10 ) 



6 ~ Dirichlet(a, a, . . .) . 

where we made the common simplification in our prior that all members 
of the vector a are identical; ak = a. 
9 is drawn from Dirichlet prior, 

^ ' fc=i 

where the normalizing constant B(a;) is 

ntiTM 



B a 



r (nf=i "fc 



To collapse 9, the expression for 'P[z\K) becomes the Multivariate Polya 
distribution. In the derivation, we have defined Uk to be the number of 
nodes in cluster k, i.e. 



nk = Y^ 



' 1 a Zi = k 



iizij^k ■ 
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In the following expression, we will also find it useful to define another 
vector of length K, 

a = (tti + ni, a2 + n2,...,aK + uk) , 



[ piz,e\K) d9= f p{9\K)F{z\e,K) dO 
Je Je 

= /p(^i^)nc 

■^•J ^ ' k=l k=l 



K 

"=d^ 

K K 



r^ n C^"'"' d^ 



e B(a, ^^^ 



K 



B(a') 



fc=i 



B a 



r(Ef=i«fc) -n- rK + afc) 
"r(iv + Ef=i«.)M TK) 

Y(A^ + fs:a) JJj r( 



a) 



A. 2 Collapsing it 

Now we look at the second integration expression in eq. (tsl) . This describes 
how to calculate the probability of a network, x, given a clustering, z, and 
the number of clusters, K. 



P{x\z,K) = / P{x,n\z,K) dvr. 
in 



This depends on whether we're using the unweighted (Bernoulli) or integer- 
weighted(Poisson) model for edges. It is also possible to allow real-valued 
weights with a Normal distribution and suitable priors, an example of such 
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a model is solved in Appendix B.2 of Wyse and Friel (2012); that paper is 



relevant for all the derivations here as the collapsing approach is quite similar 
as in this paper. 

The number of pairs of nodes in block between clusters k and I will be 
denoted pki - for blocks on the diagonal pkk will depend on whether the edges 
are directed and on whether self loops are allowed; see eq. ^ for details. 
The relevant probabilities for a given block will be shown to be a function 
only of pki and of the total weight (or total number of edges) in that block. 
We'll denote this total weight as 

Vki = 5Z ^'^ ■ 

i,j\zi=k,Zj=l 

In an undirected graph, we should consider each pair of nodes only once, 

Vki = 5Z ^^J • 

i,j\i<j,Zi=k,Zj=l 

We are to calculate the integral for a single block. X(^ki) represents the 
sub matrix of x corresponding to pairs of nodes in clusters k and /. Our goal 
is to simplify the expression such that there there is one factor for each block, 



F{x\z,K) = l[Fixiki)\z,K) 

= Y\ 'P{x{ki),TTki\z,K) dir 



For directed graphs, the product is Ylki^ giving K x K blocks. But for 
undirected graphs, the product is nfc«|fc<Z' giving \K{K + 1) blocks. The 
domain of the integration will be either j^ or j^ , depending on which of the 
two data models, unweighted or weighted, is in effect. 

We'll first consider the unweighted (Bernoulli) model. The probability of 
a node in cluster k connecting to a node in cluster / is constrained by 

< TTfc, < 1 , 

and each element of X{^ki) is drawn from a Bernoulli distribution with param- 
eter TTkU 

V{xikl)WuZ,K) = TT^f (1 - T^klY'^-'^'' . 
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The prior for n^i is a Beta(/3i, /32) distribution. 

F{x(^ki)\z,K) = / p{x(^ki),'n-M\z,K) diTki 
Jo 
• 1 

p{nki) P{x(^ki)\TTki,z,K) d-Kki 

'0 



B(/3i,/32) 



B(/3i,/32) 

B(i/h + /3i,Ph -yki + (32) 



dvr 



fci 



B(/3i,/3: 



2 



^1 y..+/3l-l(l_ )p,,-,,,+ft-l 
X / ^=77 ^7^ r^T^ dTTfcz 



/o B(|/H + /3l,PH-l/fc/ + /32) 
_ B(|/fc; + f3i,Pkl - Vkl + /32) 

B(/3i,/32) 

where B(/3i, /32) = pfg Vg ) ^^ ^^^ Beta function. This result is closely related 
to the Beta-binomial distribution. 

Next, we'll consider the Poisson model for edges in more detail. Again, 
we will see that ph and yi, are sufficient for P{x(ki)\K, z). 

In this integer-weighted model, an edge (or non-edges) between a node in 
cluster k and a node in cluster / gets its weight from a Poisson distribution 

Xi\-nki ~ Poisson(7rM) , 

and TTfcz > 0. 

This gives us, iterating over the pairs of nodes in the block. 



P{xi^ki)\T^kuZ,K)= II -^exp(-7rfc,) . 

i,j(ik,l ■' 

We can combine this expression for every block. 



38 



F{x\7!-, z,K) = Y\_'Pix(^ki)\Trki,z,K) 



kl 



n n ?T'=^p<-''") 

kl i,j£k,l •' 

n^nn 



^fc/ exp^-TTHJ 



ij ■' kl i,j&k,l 

We can ignore the Ylij ~^\: as one of those will be included for every pair 
of nodes in the network. That will contribute a constant factor to eq. d6]); 
this factor will depend only on the network x, and it will not depend on 
i^ or z or any other variable of interest, and hence we can ignore it for the 
purposes of eq. ([6]). Therefore, for our purposes we will be able to use the 
following approximation in the derivation 

P{x(^ki)\T^khZ,K) = Yl TT^f exp(-7rfc;) . 

i,j&k,l 

We'll place a Gamma prior on the rates, 

TVh ~ Gamma(s, 0) . 



/•oo 

F{x(^ki)\z,K) = / p{x(ki),'^ki\z,K)d7rM 
Jo 

^kl 
i,j£k,l ■' 

1 /"°° Q--^kl/(p 






^-'^ "^0 ^ '"^ i,jek,l 

°^ p--^ki/4> 



oc / -Kl, ^^r^^— JJ vr^'f e """diXki 



'kl 







rs 



^s 



i,j£k,l 



~ .-1+j:.,, exp(-7rMp« - ^) 

ff^W ^"- 
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We said earlier that we'll define i/ki = J2ij&ki-'^ij- We'll now substitute 
that in and also use the following definitions: 

s' = s + Vki 
1 1 

Where Gamma(s, 0) was the prior on tt^, Gamma(s', </)') is the posterior 
now that we have observed edges with total weight yki between p^i pairs of 
nodes. Returning to /, and rearranging such that we can cancel out the 
integral (because it is the integral of a Gamma distribution and hence it will 
equal 1), 

exp(-^^ 



f{x(^ki)\z,K)= I n(i ^ ^: ^^ ' d-Kki 



r(s)0^ 

r(.)0^ io ""'' ns'W '' 



Vis 



l\Als' 



Vis) 



r(s + Vki) 



Pkl+l 



s+Vkl 



Vis) 
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